Take-home Ex01

Author

Liang Xiuhao Rydia

Published

September 2, 2024

Modified

September 24, 2024

1.0 Introduction & Objectives

In this Take-Home Exercise 1, we will be discovering factors affecting road traffic accidents in the Bangkok Metropolitan Region BMR by employing both spatial and spatio-temporal point patterns analysis methods.

The specific objectives of this take-home exercise are as follows:

  • To visualize the spatio-temporal dynamics of road traffic accidents in BMR using appropriate statistical graphics and geovisualization methods.

  • To conduct detailed spatial analysis of road traffic accidents using appropriate Network Spatial Point Patterns Analysis methods.

  • To conduct detailed spatio-temporal analysis of road traffic accidents using appropriate Temporal Network Spatial Point Patterns Analysis methods.

1.1 The Study Area

We will be using Geodetic CRS WGS 84 as it is a global standard. It leaves us with the last three options. As will click into each result, we will observe that WGS 84 / PDC Mercator covers wide area of use, and perhaps, more suitable for seafare. Whereas for WGS 84 / UTM zone 47N and WGS 84 / UTM zone 48N, the area are more precise, defined by the Easting and Northing.

WGS 84 / UTM zone 47N - EPSG 32647 WGS 84 / UTM zone 48N - EPSG 32648
Area of use: Between 96°E and 102°E, northern hemisphere between equator and 84°N, onshore and offshore. China. Indonesia. Laos. Malaysia - West Malaysia. Mongolia. Myanmar (Burma). Russian Federation. Thailand. Area of use: Between 102°E and 108°E, northern hemisphere between equator and 84°N, onshore and offshore. Cambodia. China. Indonesia. Laos. Malaysia - West Malaysia. Mongolia. Russian Federation. Singapore. Thailand. Vietnam.
Projected CRS

Considering Bangkok’s Coordinate (13.7563° N, 100.5018° E), we will use WGS 84 / UTM zone 47N - EPSG 32647, as its Northing of 13.7563° falls between the Equator and 84°N, and its Easting of 100.5018° falls between 96°E and 102°E, which are both in the area of use as indicated in the table above.

1.2 The Datasets

The following datasets are provided as part of the Take Home Exercise 1:

Aspatial

  • Thailand Road Accident [2019-2022] on Kaggle

    • This dataset is a list of Thailand Road Accident between 2019 to 2022 in .csv format, where details like province, date/time of accident, cause of accident, type of vehicles and weather conditions are provided. The variables in this dataset will be used for analysis of the factors that contributes to the road traffic accidents.

Geospatial

2.0 Setting up the Environment

2.1 Installing and loading the required libraries

The code chunk below checks if the packages are installed. If the packages are not yet installed, it will proceed to install and subsequently load the libraries. If the packages are already installed, it will proceed to launch into the R environment.

pacman::p_load(knitr,
               leaflet,
               lubridate,
               raster,
               sf,
               spatstat,
               spNetwork,
               tidyverse,
               tmap)
Packages Description
knitr For dynamic report generation
leaflet For interactive map
lubridate Functions to work with date-times and time-spans
raster Reading, writing, manipulating, analyzing and modeling of spatial data. 
sf For importing, managing, ad handling geospatial data
spatstat For Spatial Point Pattern Analysis (SPPA)
spNetwork Perform spatial analysis on network
tidyverse For aspatial data wrangling
tmap For thematic mapping

2.2 Setting Seeds to Ensure Reproducibility

The set. seed() function is used to set a Random seed which Pseudorandom number generators use when generating “random” numbers. By using this function, we ensure that the randomly generated numbers remain the same when the code are reproduced.

set.seed(12345)

3.0 Importing and Wrangling the Data

3.1 Importing Aspatial Data and Converting it into Spatial Data

Importing the data without filtering:

acc <- read_csv("data/rawdata/thai_road_accident_2019_2022.csv")

Importing the data, and conducting necessary filter, conversion to sf, transformation of CRS, and get the days component of the incident_datetime:

acc_sf <- read_csv("data/rawdata/thai_road_accident_2019_2022.csv") %>%   
  filter(!is.na(longitude) &longitude != "",!is.na(latitude) & latitude != "") %>%    
  st_as_sf(coords = c("longitude", "latitude"),            crs = 4326) %>%    
  st_transform(crs = 32647) %>% 
  mutate(Day = day(incident_datetime)) %>% 
  mutate(Month = month(incident_datetime,
                       label = TRUE,
                       abbr = TRUE)) %>% 
  mutate(Year = year(incident_datetime)) %>% 
  mutate(DaysOfWeek = wday(incident_datetime,
                           week_start = 1)) %>% 
  dplyr::select(c(2,5,8:21))
Thailand Road Accident [ 2019-2022]
  • read_csv() of readr package to import the data in .csv format as tibble dataframe.
Before proceeding to filter the data, we observe that there is a total of 81,735 observations.

-   ![](images/clipboard-3840017767.png)
  • dplyr::filter() to filter out rows that has “na” or is empty in value.

    • After filtering “na” and empty values, we are left with 81,376 observations. This means that we have lost about 0.44% of the data. This will not affect our analysis as 0.44% a small proportion of the total number of observations. (Rule of Thumb, not >5% lost)

  • st_as_sf(coords = c(“longitude”, “latitude”), crs = 4326) combines and longitude and latitude columns into geometry column.

    • before applying st_as_sf() function, we observe that the latitude and longitude are in decimal degrees, therefore, we assume it is WGS84 datum, with the EPSG code of 4326.

    • Notice that the number of variables changed from 18 to 17. This is because st_as_sf() function has combined the longitude and latitude columns in the original dataset into one column name geometry. The columns named longitude and latitude are no longer found in the data.

  • st_transform() to change the Coordinate Reference System (CRS) to the correct EPSG code of 32647.

  • lubridate() is used to wrangle the incident_datetime, which is in datetime format of POSIXct.

    • lubridate::month(): label = TRUE -> change it into factor. If we do not use label = TRUE, it will be sorted using alphabetical logic. If it is a factor, it will be sorted according to date/month logic from Jan to Dec.

    • The columns of Day, Month, Year, and DaysOfWeek are created.

  • Use dplyr::select() to select the relevant columns to retain.

  • The output R object is called acc_sf and it is a sf data frame.

After the wrangling the data, we will save them for future use, using the write_rds() of readr package. The step will help us save time of re-running the codes for importing and wrangling the raw data.

write_rds(acc_sf, "data/rds/acc_sf.rds")

We can import the data using the read_rds() of the readr package:

acc_sf <- read_rds("data/rds/acc_sf.rds")

3.2 Importing Spatial Data

3.2.1 Importing the Administrative Boundaries

Importing the Thailand - Subnational Administrative Boundaries using st_read() of sf package. Since we are only interested in the Bangkok Metropolitan Region (BMR) as our study area, we will extract only the six provinces by using filter(). We will use the layer name “tha_admbnda_adm1_rtsd_20220121” as ADM1 refers to the province level boundaries. Note: Thailand administrative level 0 (country), 1 (province), 2 (district), and 3 (sub-district, tambon) boundaries.

sab_P <- st_read(dsn = "data/rawdata",
               layer = "tha_admbnda_adm1_rtsd_20220121") %>%
  filter(ADM1_EN %in% c("Bangkok","Nakhon Pathom", "Pathum Thani", "Nonthaburi", "Samut Prakan", "Samut Sakhon")) %>% 
  dplyr::select(3,17) %>% 
  st_transform(crs = 32647) 

write_rds(sab_P, "data/rds/sab_P.rds")

We will use the following code chunk to take a quick look at selected BMR boundaries.

sab_P <- read_rds("data/rds/sab_P.rds")
plot(st_geometry(sab_P))

Since we are not very familiar with the Thailand map, we may want to double check the correctness by including the label with name of the six province and the coordinates.

# setting the colors for each Province
colors = c("Bangkok" = "#FF0000",
           "Nakhon Pathom" = "#55FF00",
           "Pathum Thani" = "#FFAA00",
           "Nonthaburi" = "#0000FF",
           "Samut Prakan" = "#7F00FF",
           "Samut Sakhon" = "#00FFFF")

ggplot(data = sab_P) +
  geom_sf(aes(fill = ADM1_EN)) + 
  scale_fill_manual(values = colors, 
                    name = "Province") +
  geom_text(aes(label = ADM1_EN,
                geometry = geometry), 
            stat = "sf_coordinates", 
            size = 3)

If we wish to go to the district-level granularity, we may choose to import the ADM2 layer instead.

sab_C <- st_read(dsn = "data/rawdata",
                 layer = "tha_admbnda_adm2_rtsd_20220121") %>%  
  filter(ADM1_EN %in% c("Bangkok","Nakhon Pathom", "Pathum Thani", "Nonthaburi", "Samut Prakan", "Samut Sakhon")) %>% 
  dplyr::select(3,11,20) %>% 
  st_transform(crs = 32647) 

write_rds(sab_C, "data/rds/sab_C.rds")

Similarly, we will use ggplot2() to visualise the using the Level 2 - District (Amphoe) boundaries.

sab_C <- read_rds("data/rds/sab_C.rds")
ggplot(data = sab_C) +
  geom_sf(aes(fill = ADM1_EN)) +
  scale_fill_manual(values = colors, 
                    name = "Province")

Boundaries
  • For creating the boundaries for the study area, ADM1_EN would be sufficient.

  • However, we may want to have ADM2_EN for detailed study for each districts within the BMR.

3.2.2 Creating owin object for BMR boundaries

bmr_owin <- as.owin(sab_P)
write_rds(bmr_owin, "data/rds/bmr_owin.rds")
bmr_owin <- read_rds("data/rds/bmr_owin.rds")
plot(bmr_owin)

summary(bmr_owin)
Window: polygonal boundary
single connected closed polygon with 13779 vertices
enclosing rectangle: [587893.5, 712440.5] x [1484413.7, 1579076.3] units
                     (124500 x 94660 units)
Window area = 7668990000 square units
Fraction of frame area: 0.65

Window Area / Size of the Area

Important
  • Window Area = 7,668,990,000 square units (m2)

    • converting to km2 by dividing it by 1,000,000.

    • Window Area of BMR = ~7,668.99 km2

3.2.3 Importing the Road Networks

In this code chunk, we will import the Thailand Roads (OpenStreetMap Export) downloaded from HDX.

road <- st_read(dsn ="data/rawdata",
                 layer = "hotosm_tha_roads_lines_shp") 

Assigning EPSG code 4326 as the geometry shows longlat decimal degree between (-180 to 180). Next, we use st_transform() to re-project from Geographic CRS of WGS84 to Projected CRS of UTM Zone 47N. After exploring the data, we can see that most of the accidents occur on the four types of highway: “motorway”, “trunk”, “primary”,“secondary”.

roads_clean <- road %>% 
  filter(highway %in% c("motorway", "trunk", "primary","secondary")) %>% 
  dplyr::select(2:4,15) %>% 
  st_set_crs(4326) %>%
  st_transform(crs = 32647)

Saving road_clean as .rds:

write_rds(roads_clean, "data/rds/roads_clean.rds")
sab_P <- read_rds("data/rds/sab_P.rds")
roads_clean <- read_rds("data/rds/roads_clean.rds")

roads_bmr <- st_intersection(sab_P,roads_clean)

Saving the roads network in BMR:

write_rds(roads_bmr, "data/rds/roads_bmr.rds")
roads_bmr <- read_rds("data/rds/roads_bmr.rds")

plot(st_geometry(roads_bmr))
plot(st_geometry(sab_P), add = T)

3.2.5 Extracting the accidents records within BMR

acc_sf <- read_rds("data/rds/acc_sf.rds")

acc_bmr <- st_intersection(sab_P,acc_sf)

Saving the accidents records within BMR:

write_rds(acc_bmr,"data/rds/acc_bmr.rds")

3.3 Visualising the Geospatial Data

Before jumping into the analysis, we will use plot(), ggplot() and tmap() to take a quick look.

sab_C <- read_rds("data/rds/sab_C.rds")
acc_bmr <-read_rds("data/rds/acc_bmr.rds")

plot(st_geometry(roads_bmr),
     col = "grey")
plot(acc_bmr, add = T, 
     col = "red", 
     pch = 19,
     cex = 0.1)
plot(st_geometry(sab_C), add = T)

highway_color = c("motorway" = "#FF00FF",
           "trunk" = "#00FF00",
           "primary" = "#FFFF00",
           "secondary" = "#0000FF",
           "tertiary" = "#33AABB",
           "unclassified" = "#5500FF")
  
ggplot() +
  geom_sf(data = sab_C, 
          color = "black",
          size = 1.0) +
  geom_sf(data = roads_bmr, 
          aes(color = highway), 
          size = 1) + 
  geom_sf(data = acc_bmr, 
          color = "red", 
          size = 0.1) + 
  labs(title = "Road Network and Accident Points", 
       x = "Longitude", 
       y = "Latitude") +
  scale_color_manual(values = highway_color) 

tmap_mode('view')

tm_shape(acc_bmr) + 
  tm_dots(col ="red",
          alpha = 0.5,
          size = 0.1) +
  tm_shape(roads_bmr) +
  tm_lines(col = "highway", 
           palette = highway_color,  
           lwd = 2)
tmap_mode("plot")

Using tmap’s view mode to explore, we are able to see that the motorway, primary, secondary and trunk highway types have higher accident records. There is significantly lesser incidents along tertiary and unclassified road.

Since most of the accidents happens in just four categories mentioned, we had filter the accident data for those that happened in the relevant highway types, so that we can save on computation time.

Instead of using tmap to physically track where are the accidents happening, we can also choose to extract these information from the “route” column in acc_bmr.

4.0 Network Constrained Spatial Point Pattern Analysis

4.1 Preparing the lixel objects

Before computing NKDE, the SpatialLines object need to be cut into lixels with a specified minimal distance. This task can be performed by using with lixelize_lines() of spNetwork as shown in the code chunk below.

As there some of the geometry in roads_bmr are “MULTILINESTRING”, we will not be able to process it using lixelize_lines() directly. lixelize_lines() requires the all geometries to be “LINESTRING”, thus, we will use st_cast() to cast all the “MULTILINESTRING” into “LINESTRING” geometry.

roads_ls <- st_cast(roads_bmr, "LINESTRING")
write_rds(roads_ls, "data/rds/roads_ls.rds")
Choosing the cell size
  • recall that Window Area of BMR = ~7,668.99 km2

  • we will set the length of each lixel to be 500, which is equals to 500m on the map

lixels <- lixelize_lines(roads_ls, 
                         5000, 
                         mindist = 3000)
write_rds(lixels,"data/rds/lixels")

4.2 Generating line centre points

Next, lines_center() of spNetwork will be used to generate a SpatialPointsDataFrame (i.e. samples) with line centre points as shown in the code chunk below.

samples <- lines_center(lixels) 
write_rds(samples, "data/rds/samples.rds")

4.3 Performing NKDE

NKDE
  • We will use “Simple” method for quick visualisation due to large dataset.
  • We will set bandwidth (bw) as 10,000, which is equivalent to 10km
densities <- nkde(roads_ls, 
                  events = acc_bmr,
                  w = rep(1, nrow(acc_bmr)),
                  samples = samples,
                  kernel_name = "quartic",
                  bw = 20000,
                  div= "bw", 
                  method = "simple", 
                  digits = 1, 
                  tol = 1,
                  grid_shape = c(1,1), 
                  max_depth = 8,
                  agg = 100, 
                  sparse = TRUE,
                  verbose = FALSE)
write_rds(densities,"data/rds/densities.rds")

4.3.1 Visualising NKDE

densities <- read_rds("data/rds/densities.rds")

samples$density <- densities
lixels$density <- densities

Rescaling to per 1 kilometers

samples$density <- samples$density*1000
lixels$density <- lixels$density*1000

Using tmap for visualisation

tmap_mode('view')
tm_shape(lixels)+
  tm_lines(col="density")+
tm_shape(acc_bmr)+
  tm_dots()
tmap_mode('plot')
Hotspots
  • Hotspot 1 is at the highway along Suvarnabhumi International Airport.

    • This is one of the busiest areas due to heavy traffic associated with the airport. Possible accident due to tourists who are driving in unfamiliar places.
  • Hotspot 2 is along highway 338, Borommaratchachonnani Rd.

    • This highway is a major route leading out of Bangkok.
  • Hotspot 3 is along highway 8, Kanchanaphisek Road.

    • This highway is a major route leading out of Bangkok.

lixels = 5,000, mindist = 3,000.

bw = 10,000

4.4 Network Constrained G- and K-Function Analysis

In this section, we are going to perform complete spatial randomness (CSR) test by using kfunctions() of spNetwork package. The null hypothesis is defined as:

Ho: The observed spatial point events (i.e distribution of road accidents) are uniformly distributed over a street network in BMR.

The CSR test is based on the assumption of the binomial point process which implies the hypothesis that the road accidents are randomly and independently distributed over the street network.

If this hypothesis is rejected, we may infer that the distribution of road accidents are spatially interacting and dependent on each other; as a result, they may form nonrandom patterns.

kfun_acc <- kfunctions(roads_ls, 
                       acc_bmr,
                       start = 0, 
                       end = 1000000, 
                       step = 10000, 
                       width = 10000, 
                       nsim = 50, 
                       resolution = 10,
                       verbose = FALSE, 
                       conf_int = 0.05,
                       agg = 1000)

5.0 Temporal Network Kernel Density Estimate (TNKDE)

Events recorded on a network often have a temporal dimension. In that context, one could estimate the density of events in both time and network spaces.

The spatio-temporal kernel is calculated as the product of the network kernel density and the time kernel density.

5.1 Temporal Dimension

acc_bmr <- read_rds("data/rds/acc_bmr.rds")

# converting the Date field to a numeric field (counting days)
acc_bmr$Time <- as.POSIXct(acc_bmr$incident_datetime, format = "%Y/%m/%d")
start <- as.POSIXct("2019/01/01", format = "%Y/%m/%d")

acc_bmr$Time <- difftime(acc_bmr$Time, start, units = "days")
acc_bmr$Time <- as.numeric(acc_bmr$Time)

years <- as.character(2019:2022)
months <- as.character(1:12)
months <- ifelse(nchar(months) == 1, paste0("0", months), months)

months_starts_labs <- expand.grid(years, months)
months_starts_labs <- paste(months_starts_labs$Var1, "/", months_starts_labs$Var2, "/01", sep = "")

months_starts_num <- as.POSIXct(months_starts_labs, format = "%Y/%m/%d")
months_starts_num <- difftime(months_starts_num, start, units = "days")
months_starts_num <- as.numeric(months_starts_num)
months_starts_labs <- gsub("/01","", months_starts_labs)

ggplot(acc_bmr) + 
  geom_histogram(aes(x = Time), bins = 30, color = "white") + 
  scale_x_continuous(breaks = months_starts_num, labels = months_starts_labs) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

5.1.1 Calculate the kernel density values in time for several bandwidths

w <- rep(1,nrow(acc_bmr))
samples <- seq(0, max(acc_bmr$Time), 0.5)

time_kernel_values <- data.frame(
  bw_10 = tkde(acc_bmr$Time, w = w, samples = samples, bw = 10, kernel_name = "quartic"),
  bw_20 = tkde(acc_bmr$Time, w = w, samples = samples, bw = 20, kernel_name = "quartic"),
  bw_30 = tkde(acc_bmr$Time, w = w, samples = samples, bw = 30, kernel_name = "quartic"),
  bw_40 = tkde(acc_bmr$Time, w = w, samples = samples, bw = 40, kernel_name = "quartic"),
  bw_50 = tkde(acc_bmr$Time, w = w, samples = samples, bw = 50, kernel_name = "quartic"),
  bw_60 = tkde(acc_bmr$Time, w = w, samples = samples, bw = 60, kernel_name = "quartic"),
  bw_70 = tkde(acc_bmr$Time, w = w, samples = samples, bw = 70, kernel_name = "quartic"),
  bw_80 = tkde(acc_bmr$Time, w = w, samples = samples, bw = 80, kernel_name = "quartic"),
  time = samples
)

df_time <- reshape2::melt(time_kernel_values,id.vars = "time")
df_time$variable <- as.factor(df_time$variable)

ggplot(data = df_time) + 
  geom_line(aes(x = time, y = value)) +
  scale_x_continuous(breaks = months_starts_num, labels = months_starts_labs) +
  facet_wrap(vars(variable), ncol=2, scales = "free") + 
  theme(axis.text = element_text(size = 5, angle = 90))

Intepreting the tkde()
  • Cyclical pattern - Higher number of road traffic incidents in December period, which is the start of Thailand’s peak tourist season.

  • Songkran festival - between 2019 and 2022, we can observe peak in Apr for year 2019, 2021 and 2022 using the bw_10 and bw_20 time series charts. In year 2020, Songkran was officially postponed due to COVID-19.

  • Lowest through - around April-July 2021 (bw_30 to bw_80), which coincides with Thailand’s ban on all passengers flights landing in Thailand till July 2021 due to third wave of COVID-19 in the country.

Source: https://ourworldindata.org/coronavirus/country/thailand

5.2 Spatial Dimension

Before considering the spatio-temporal case, we can also investigate the spatial dimension.

roads_bmr <- read_rds("data/rds/roads_bmr.rds")
tm_shape(roads_bmr) + 
  tm_lines(col = "black") + 
  tm_shape(acc_bmr) + 
  tm_dots(col = "red", size = 0.1)
densities <- read_rds("data/rds/densities.rds")

samples$density <- densities
lixels$density <- densities
samples$density <- samples$density*1000
lixels$density <- lixels$density*1000

Plotting with tmap:

tmap_mode('view')
tm_shape(lixels)+
  tm_lines(col="density")+
tm_shape(acc_bmr)+
  tm_dots()
tmap_mode('plot')

5.3 Spatio-Temporal

cv_scores <- bw_tnkde_cv_likelihood_calc(
  bw_net_range = c(100,500),
  bw_net_step = 100,
  bw_time_range = c(10,80),
  bw_time_step = 10,
  lines = roads_bmr,
  events = acc_bmr,
  time_field = "Time",
  w = rep(1, nrow(acc_bmr)),
  kernel_name = "quartic",
  method = "simple",
  diggle_correction = FALSE,
  study_area = NULL,
  max_depth = 10,
  digits = 2,
  tol = 0.1,
  agg = 10,
  sparse=TRUE,
  grid_shape=c(1,1),
  sub_sample=1,
  verbose = FALSE,
  check = TRUE)
knitr::kable(cv_scores)

References

News: https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries